Goals:
Expose you to some foundational techniques & vocab for analyzing text. There is a focus on “bag of words” techniques rather than diving into deep learning techniques (foundations first). Brevity was valued over agonizing detail.
# data manipulation
library(dplyr)
# text stuff
library(tidytext)
library(SnowballC)
library(textstem)
library(textdata)
# visuals
library(wordcloud2)
library(ggplot2)
We’ll be leveraging the tidytext package due to its
approachability and focus on the data.frame as the main
object type for analysis.
The package has an accompanying book for free online here: https://www.tidytextmining.com/.
The data being used is every line in the script from the TV show: The Office. The data was sourced from this post on Reddit, and can be found directly in this google sheet.
First vocab words!!
office <- read.csv("the_office_script.csv")
dim(office)
## [1] 59909 7
names(office)
## [1] "id" "season" "episode" "scene" "line_text" "speaker"
## [7] "deleted"
Many traditional methods for text analytics treat text as an unordered collection of words. This leads to many techniques that boil down to counting how many times different words occur. Many more modern methods take more advantage of neural networks and word embeddings to focus more on a word’s context.
However, bag of words techniques are still worthwhile due to them being:
The unit of analysis in a bag of words approach is a single word or a “token”.
tidytext’s approach to this is the
unnest_tokens function. It will break up each
document into it’s individual tokens while still keeping useful
ID information in the rows.
office_tokens <- office %>%
unnest_tokens(word, line_text)
office_tokens %>%
select(speaker, word) %>%
head()
## speaker word
## 1 Michael all
## 2 Michael right
## 3 Michael jim
## 4 Michael your
## 5 Michael quarterlies
## 6 Michael look
We can now start to do different counting tasks using typical R
data.frame methods.
Most commonly occurring words shown below. Ahh the insights!
top_tokens <- office_tokens %>%
group_by(word) %>%
summarise(count = n()) %>%
arrange(-count)
top_tokens %>%
head()
## # A tibble: 6 × 2
## word count
## <chr> <int>
## 1 i 23599
## 2 you 22173
## 3 the 17722
## 4 to 16637
## 5 a 15218
## 6 and 11158
In the past section we tried to do an analysis of the most common
words and it turns out the most common words are boring… Luckily we’re
not the first ones to come across this issue. These are referred to as
stop words, there are different lists we can use to remove
these. The tidytext package provides a
stop_words data.frame that holds 3 separate
lists; we can either choose one or just use them all.
# stop words per list
stop_words %>%
group_by(lexicon) %>%
summarise(count = n()) %>%
arrange(-count)
## # A tibble: 3 × 2
## lexicon count
## <chr> <int>
## 1 SMART 571
## 2 onix 404
## 3 snowball 174
# example of some stopwords
stop_words[sample(nrow(stop_words), size = 8), ]
## # A tibble: 8 × 2
## word lexicon
## <chr> <chr>
## 1 said SMART
## 2 very onix
## 3 went SMART
## 4 contains SMART
## 5 far onix
## 6 backed onix
## 7 while onix
## 8 seemed SMART
To remove these words from our tokens we can do some filtering.
filtered_top_tokens <- top_tokens %>%
anti_join(stop_words, by = "word")
paste(nrow(top_tokens) - nrow(filtered_top_tokens), "instances of stop words removed")
## [1] "675 instances of stop words removed"
wordcloud2(filtered_top_tokens)
There’s still some words that you might not find too valuable for this analysis. Keep in mind that you aren’t limited to using the words in a pre-definded list. There might be some times where building an industry specific set of stop words might make sense.
Practice tokenizing text and data manipulation by:
Practice stop word removal by:
Sometimes we don’t want to differentiate between different tenses/usages of the same word; for example, below the word “stop” is used in 4 different ways. Depending on the insights you want to find you might want to keep these separate or you might to combine them into just one “stop” category.
Chopping of the endings of these words will leave you with the stem of “stop”.
stops <- top_tokens %>%
filter(grepl("stop", word)) %>%
head(4)
stops
## # A tibble: 4 × 2
## word count
## <chr> <int>
## 1 stop 614
## 2 stops 45
## 3 stopped 35
## 4 stopping 20
Below we apply the wordStem() function from the
SnowballC package. This function leaves us with just the
stem of each word.
stops %>%
mutate(stem = wordStem(word))
## # A tibble: 4 × 3
## word count stem
## <chr> <int> <chr>
## 1 stop 614 stop
## 2 stops 45 stop
## 3 stopped 35 stop
## 4 stopping 20 stop
If we repeat the same style word counting analysis now we might get some different results (might not… :shrug:).
In the below cell the complete analysis is restarted from scratch.
Tokenize -> remove stop words -> stem -> aggregate. I’ve added
the example_word column that shows what the word might have
looked like before stemming (just 1 of the many potential words). In
some cases it can be tough to see what the word was before stemming (eg
see “hei” below which was originally “hey” or “gui” which was originally
“guy”)
office %>%
unnest_tokens(word, line_text) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
group_by(stem) %>%
summarise(count = n(), example_word = first(word)) %>%
arrange(-count)
## # A tibble: 14,949 × 3
## stem count example_word
## <chr> <int> <chr>
## 1 yeah 3227 yeah
## 2 michael 2657 michael
## 3 hei 2420 hey
## 4 dwight 2144 dwight
## 5 jim 1911 jim
## 6 pam 1629 pam
## 7 uh 1624 uh
## 8 gui 1616 guys
## 9 gonna 1555 gonna
## 10 time 1496 times
## # … with 14,939 more rows
Word stemming is a rule based approach to finding word stems: remove ‘s’ from end of word, remove ‘ing’ from end of word, etc. This has the pro of being flexible to unique words that might not even appear on urbandictionary. This has the con of missing words that don’t follow the typical rules for changing tense.
The 2 below cases can explain this point more directly.
stops <- c("stops", "stopping", "stopped")
swims <- c("swim", "swam", "swum")
Word stemming can do well when removing endings is the right move, but it will fail on words that follow patterns that don’t concern the word ending.
wordStem(stops)
## [1] "stop" "stop" "stop"
wordStem(swims)
## [1] "swim" "swam" "swum"
A different but like minded process of finding the root of the token is “lemmatization”. We want to find the “lemma” of each input word. This is shown below to work on both of our example cases. This has the opposite pros/cons of stemming. Lemmatization relies on a dictionary based approach, so if there is unusual vocab in your corpus it might not be very effective.
lemmatize_words(stops)
## [1] "stop" "stop" "stop"
lemmatize_words(swims)
## [1] "swim" "swim" "swim"
PS no one is stopping you from doing both stemming and lemmatization. If you apply both, definitely apply lemmatization before stemming. Lemmatization always outputs a valid English word, stemming can make some words unrecognizable (which would hurt lemmatizing).
“Sentiment analysis” is a term used where we try and describe text based on what feelings it emits. This can be feelings of “positive” vs “negative” or it can get more into nuanced sentiments like “anger”, “fear”, “trust”.
One way to do sentiment analysis is with a dictionary based approach. This has the typical limitations of dictionary based approaches: it only works if the word is in your dictionary.
The tidytext package provides routes to multiple
sentiment dictionaries:
“afinn” - provides a word and its score. the score’s sign (pos or neg) indicates a positive or negative sentiment. the score’s magnitude indicates the strength of the sentiment
“bing” - provides a word and a label of the word as positive or negative
“nrc” - provides a word and a label of the word’s sentiment. provides a diverse set of sentiment labels (ie sadness, anger, etc)
“loughran” - provides a word and a label of the word’s sentiment. Developed from financial reports so this is a powerful or useless dictionary depending on the context: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”
Example using afinn
afinn <- get_sentiments("afinn")
sum_sentiment_by_char <- office_tokens %>%
left_join(afinn, by = "word") %>%
group_by(speaker) %>%
summarise(sentiment = sum(value, na.rm = TRUE))
# Most negative speakers
sum_sentiment_by_char %>%
arrange(sentiment) %>%
head(3)
## # A tibble: 3 × 2
## speaker sentiment
## <chr> <dbl>
## 1 Vance Refrigeration Worker #2 -35
## 2 Devon -18
## 3 Teddy -14
# Most positive speakers
sum_sentiment_by_char %>%
arrange(-sentiment) %>%
head()
## # A tibble: 6 × 2
## speaker sentiment
## <chr> <dbl>
## 1 Michael 8869
## 2 Jim 3962
## 3 Pam 3420
## 4 Andy 2559
## 5 Dwight 2444
## 6 Erin 820
Example using bing
nrc <- get_sentiments("nrc")
sum_sentiment_by_char <- office_tokens %>%
left_join(nrc, by = "word") %>%
anti_join(stop_words, by = "word") %>%
filter(!is.na(sentiment)) %>%
group_by(speaker, sentiment) %>%
summarise(count = n(), example_word = first(word)) %>%
arrange(-count)
## `summarise()` has grouped output by 'speaker'. You can override using the
## `.groups` argument.
# Most common speaker x sentiment pairs
sum_sentiment_by_char %>%
head()
## # A tibble: 6 × 4
## # Groups: speaker [2]
## speaker sentiment count example_word
## <chr> <chr> <int> <chr>
## 1 Michael positive 6092 library
## 2 Michael negative 4151 mistake
## 3 Michael trust 3352 guidance
## 4 Michael anticipation 3346 deal
## 5 Dwight positive 3318 word
## 6 Michael joy 3276 deal
A limitation of this dictionary approach is only considering a single word at time. For example, “good” can express positive sentiment, but what if I had “not” in front of it or “very”? Some methods try and look into this idea by negating or boosting the score of the following word. However it turns out that’s not enough since language can be very nuanced, for example “I just got over having the flu, what a great experience that was…”
Due to these limitations, more modern/cutting edge sentiment methods utilize deep learning. This isn’t to say that the dictionary methods are totally useless. If you know your corpus and have some expectations of the results (and follow-up analysis on results) you’ll be able to tell if the dictionary approach isn’t cutting it for your application.
Practice tokenizing text and data manipulation by:
Let’s revisit the motivation first using stop words. We like removing stopwords because they are so common that they don’t provide value, but general stopwords might not cut it. For example, in the world of The Office they work at the a paper company, maybe we want to consider “paper” a stop word.
Instead of always building a stop word dictionary, we can try and put a number to the value of a word. This is the goal of TF-IDF
TF stands for “term frequency” - its a measure of how often a word appears in a document IDF stands for “inverse document frequency” - its a measure of how many documents a word appears in
The below analysis removes stopwords and then calculates the tfidf of words by each office character. It turns out it’s a way to find the people they talk about the most (and often turns out to be a love interest finder per person).
# treat each speaker as a "document"
# what are the highest tf-idf words per speaker
tfidf_by_speaker <- office_tokens %>%
anti_join(stop_words, by = "word") %>%
group_by(word, speaker) %>%
summarise(n = n()) %>%
bind_tf_idf(word, speaker, n) %>%
arrange(-tf_idf)
## `summarise()` has grouped output by 'word'. You can override using the
## `.groups` argument.
tfidf_by_speaker %>%
filter(n > 50) %>%
head(8)
## # A tibble: 8 × 6
## # Groups: word [6]
## word speaker n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 michael Jan 204 0.0720 1.76 0.126
## 2 michael David 75 0.0603 1.76 0.106
## 3 andy Erin 126 0.0265 2.43 0.0643
## 4 michael Holly 52 0.0326 1.76 0.0573
## 5 ryan Kelly 66 0.0197 2.83 0.0557
## 6 god Kelly 61 0.0182 2.47 0.0451
## 7 uh Toby 80 0.0255 1.76 0.0450
## 8 dwight Angela 112 0.0221 2.03 0.0448
Practice TF-IDF analysis by:
tidytext and ggplot2 packages